10 research outputs found
Constrained Output Embeddings for End-to-End Code-Switching Speech Recognition with Only Monolingual Data
The lack of code-switch training data is one of the major concerns in the
development of end-to-end code-switching automatic speech recognition (ASR)
models. In this work, we propose a method to train an improved end-to-end
code-switching ASR using only monolingual data. Our method encourages the
distributions of output token embeddings of monolingual languages to be
similar, and hence, promotes the ASR model to easily code-switch between
languages. Specifically, we propose to use Jensen-Shannon divergence and cosine
distance based constraints. The former will enforce output embeddings of
monolingual languages to possess similar distributions, while the later simply
brings the centroids of two distributions to be close to each other.
Experimental results demonstrate high effectiveness of the proposed method,
yielding up to 4.5% absolute mixed error rate improvement on Mandarin-English
code-switching ASR task.Comment: 5 pages, 3 figures, accepted to INTERSPEECH 201
Cloud-based Automatic Speech Recognition Systems for Southeast Asian Languages
This paper provides an overall introduction of our Automatic Speech
Recognition (ASR) systems for Southeast Asian languages. As not much existing
work has been carried out on such regional languages, a few difficulties should
be addressed before building the systems: limitation on speech and text
resources, lack of linguistic knowledge, etc. This work takes Bahasa Indonesia
and Thai as examples to illustrate the strategies of collecting various
resources required for building ASR systems.Comment: Published by the 2017 IEEE International Conference on Orange
Technologies (ICOT 2017
Independent language modeling architecture for end-to-end ASR
The attention-based end-to-end (E2E) automatic speech recognition (ASR)
architecture allows for joint optimization of acoustic and language models
within a single network. However, in a vanilla E2E ASR architecture, the
decoder sub-network (subnet), which incorporates the role of the language model
(LM), is conditioned on the encoder output. This means that the acoustic
encoder and the language model are entangled that doesn't allow language model
to be trained separately from external text data. To address this problem, in
this work, we propose a new architecture that separates the decoder subnet from
the encoder output. In this way, the decoupled subnet becomes an independently
trainable LM subnet, which can easily be updated using the external text data.
We study two strategies for updating the new architecture. Experimental results
show that, 1) the independent LM architecture benefits from external text data,
achieving 9.3% and 22.8% relative character and word error rate reduction on
Mandarin HKUST and English NSC datasets respectively; 2)the proposed
architecture works well with external LM and can be generalized to different
amount of labelled data
Contrastive Speech Mixup for Low-resource Keyword Spotting
Most of the existing neural-based models for keyword spotting (KWS) in smart
devices require thousands of training samples to learn a decent audio
representation. However, with the rising demand for smart devices to become
more personalized, KWS models need to adapt quickly to smaller user samples. To
tackle this challenge, we propose a contrastive speech mixup (CosMix) learning
algorithm for low-resource KWS. CosMix introduces an auxiliary contrastive loss
to the existing mixup augmentation technique to maximize the relative
similarity between the original pre-mixed samples and the augmented samples.
The goal is to inject enhancing constraints to guide the model towards simpler
but richer content-based speech representations from two augmented views (i.e.
noisy mixed and clean pre-mixed utterances). We conduct our experiments on the
Google Speech Command dataset, where we trim the size of the training set to as
small as 2.5 mins per keyword to simulate a low-resource condition. Our
experimental results show a consistent improvement in the performance of
multiple models, which exhibits the effectiveness of our method.Comment: Accepted by ICASSP 202
Are Soft Prompts Good Zero-shot Learners for Speech Recognition?
Large self-supervised pre-trained speech models require computationally
expensive fine-tuning for downstream tasks. Soft prompt tuning offers a simple
parameter-efficient alternative by utilizing minimal soft prompt guidance,
enhancing portability while also maintaining competitive performance. However,
not many people understand how and why this is so. In this study, we aim to
deepen our understanding of this emerging method by investigating the role of
soft prompts in automatic speech recognition (ASR). Our findings highlight
their role as zero-shot learners in improving ASR performance but also make
them vulnerable to malicious modifications. Soft prompts aid generalization but
are not obligatory for inference. We also identify two primary roles of soft
prompts: content refinement and noise information enhancement, which enhances
robustness against background noise. Additionally, we propose an effective
modification on noise prompts to show that they are capable of zero-shot
learning on adapting to out-of-distribution noise environments
SPGM: Prioritizing Local Features for enhanced speech separation performance
Dual-path is a popular architecture for speech separation models (e.g.
Sepformer) which splits long sequences into overlapping chunks for its intra-
and inter-blocks that separately model intra-chunk local features and
inter-chunk global relationships. However, it has been found that inter-blocks,
which comprise half a dual-path model's parameters, contribute minimally to
performance. Thus, we propose the Single-Path Global Modulation (SPGM) block to
replace inter-blocks. SPGM is named after its structure consisting of a
parameter-free global pooling module followed by a modulation module comprising
only 2% of the model's total parameters. The SPGM block allows all transformer
layers in the model to be dedicated to local feature modelling, making the
overall model single-path. SPGM achieves 22.1 dB SI-SDRi on WSJ0-2Mix and 20.4
dB SI-SDRi on Libri2Mix, exceeding the performance of Sepformer by 0.5 dB and
0.3 dB respectively and matches the performance of recent SOTA models with up
to 8 times fewer parameters
INDEPENDENT LANGUAGE MODELING ARCHITECTURE FOR END-TO-END ASR
The attention-based end-to-end (E2E) automatic speech
recognition (ASR) architecture allows for joint optimization
of acoustic and language models within a single network.
However, in a vanilla E2E ASR architecture, the decoder
sub-network (subnet), which incorporates the role of the lan guage model (LM), is conditioned on the encoder output.
This means that the acoustic encoder and the language model
are entangled that doesn’t allow language model to be trained
separately from external text data. To address this problem,
in this work, we propose a new architecture that separates
the decoder subnet from the encoder output. In this way, the
decoupled subnet becomes an independently trainable LM
subnet, which can easily be updated using the external text
data. We study two strategies for updating the new architec ture. Experimental results show that, 1) the independent LM
architecture benefits from external text data, achieving 9.3%
and 22.8% relative character and word error rate reduction on
Mandarin HKUST and English NSC datasets respectively; 2)
the proposed architecture works well with external LM and
can be generalized to different amount of labelled data
The NNi Vietnamese speech recognition system for mediaeval 2016
This paper provides an overall description of the Vietnamese speech recognition system developed by the joint team for MediaEval 2016. The submitted system consisted of 3 subsystems, and adopted different deep neural network-based techniques such as fMLLR transformed bottleneck features, sequence training, etc. Besides the acoustic modeling techniques, speech data augmentation was also examined to develop a more robust acoustic model. The I2R team collected a number of text resources from the Internet and made them available to other participants in the task. The web text crawled from the Internet was used to train a 5-gram language model. The submitted system obtained the token error rate (TER) of 15.1, 23.0 and 50.5 on Devel local set, Devel set and Test set, respectively.Published versio